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(54) Application-specific conflict resolution for weakly consistent replicated databases 



(57) Write operations for weekly consistent replicat- 
ed database systems have cntoedded application-spe- 
cific procedures that are mvo*ec lo- resolving conflicts 
whenever it is found that the related write operation con- 
flicts with the current state ol a given instance of such a 
database. 
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Description 

This invention relates to replicated weakly consist- 
ent data storage systems and, more particularly, to an 
automated technique for implementing application spe- s 
cific conflict resolution in such systems. - 

Replicated, weakly consistent databases are well 
suited for applications involving the sharing of data 
among multiple users with low speed or intermittent 
communication links. As an example, these applications 10 
can run in a mobile computing environment that includes 
portable machines with less than ideal network connecr 
tivity. A user's computer may have a wireless communi- 
cation device, such as a cell modem or packet radio 
transceiver relying on a network infrastructure that may is 
suffer from not being universally available and/or from 
being very expensive. Such a computer may use short- 
range line -of -sight communication, such as the infrared 
"beaming" ports available on some commercial person- 
al digital assistants(PDAs). Alternatively, the computer 20 
may have a conventional modem requiring it to be phys- 
ically connected to a phone line when sending and re- 
ceiving data, or it may only be able to communicate with 
the rest of the system when inserted in a docking station. 
Indeed, the computer's only communication device may 25 
be a diskette that is transported between machines by 
humans. Accordingly, it will be apparent that a mobile 
computer may experience extended and sometimes in- 
voluntary disconnection from many or all of the other de- 
vices with which it wants to share data.. 30 

In practice, mobile users may want to, share their 
appointment calendars, bibliographic databases, meet- 
ing notes, evolving design documents, news bulletin 
boards, and other types of data in spite of their intermit- 
tent network connectivity. Thus there is a need tor sys- 3S 
terns that enable mobile clients to actively read and write 
shared data. Even though such a system most probably 
will have to cope with both voluntary and involuntary 
communication outages, it should behave from the us-., 
er's viewpoint, to the extent possible, like a centralized, 40 
highly-available database service. 

In accordance with this invention, write operations 
for weakly consistent replicated database systems have 
application-specific embedded procedures that are in- 
voked for resolving conflicts whenever it is found that *s 
the related write operation conflicts with the current state 
of a given instance of such a database. These proce- 
dures are arbitrary procedures which are tailored to re- 
solve such conflicts in a manner that is calculated to sat- 
isfy the requirements of the application that provides so 
them. 

In one aspect of the invention, there is provided an 
application -specific process for resolving conflicts that 
are found to exist between an instance of a database 
and write operations that are. presented for updating ss 
said instance of said database, said process compris- 
ing: associating an arbitrary merge-procedure, specified 
by said application, with each of said write operations, 
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where the effect of executing the merge-procedure is 
deterministic for any given state of said instance of said 
database at any time that said merge-procedure is ex- 
ecuted; and executing the merge-procedure associated 
with a given write operation whenever it is found that the 
given write operation conflicts with said instance of said 
database, said merge-procedure producing a set of up- 
dates that are applied to said database in lieu of any 
updates originally contemplated by the given write op- 
eration. 

The present invention will now be described, by way 
of example, with reference to the accompanying draw- 
ings, in which: 

Fig. 1 is a simplified block diagram of a client/server 
architecture that may be used to carry out the 
present invention; 

Fig. 2 shows how the architecture of Fig. 1 can be 
extended to include session managers for enforcing 
selected session guarantees on behalf of the cli- 
ents; 

Fig. 3 is a flow diagram for a write execution proc- 
ess; 

Fig. 4 is a flow diagram for an application specific 
conflict detection process; 

Fig. 5 is a flow diagram for an application specific 
conflict resolution process; 

Fig. 6 is a schematic of a write log that discriminates 
between committed writes and tentative writes to 
identify a database so stable data ("committed da- 
tabase") and an extended database that includes 
potentially unstable data (full database"): 
Fig. 7 is a flow diagram of a process for handling 
writes received from client applications; 
Fig. 8 is a flow diagram of a process for handling 
writes received from another server via anti-entro- 

- py; 

Fig. 9 is a flow diagram of a process for handling 
writes received from client applications by a primary 
server; 

Fig. 10 expands on Fig. 8 to illustrate a process for 

handling writes received at a secondary server via 

anti-entropy from other servers; 

Fig. 11 expands on Fig. 10 to illustrate a process for 

committing writes at secondary servers; 

Fig. 12 illustrates a scenario of the type that cause 

write re-ordering; and 

Figs. 1 3 and 1 4 track the scenario shown in Fig. 1 2. 

A. A Typical Environment 

Some computational tools, such as PDAs (Personal 
Digital Assistants), have insufficient storage for holding 
copies of all, or perhaps any, of the data that their users 
want to access. For this reason, this invention conven- 
iently is implemented by systems that are architected, 
as shown, in Fig 1 , to divide functionality between serv- 
ers 11-13, which store data, and clients 15 and 16, which 
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read and write data that is managed by servers. A server 
is any machine that holds a complete copy of one or 
more databases. The term "database" is used loosely 
herein to denote a collection of data items, regardless 
of whether such data is managed as a relational data- 
base, is simply stored in a conventional file system, or 
conforms to any other data model. Clients are able to 
access data residing on any server to which they can 
communicate, and conversely, any machine holding a 
copy of a database, including personal laptops, are ex- 
pected to be willing to service read and write requests 
from other clients. 

Portable computers may be servers for some data- 
bases and clients for others. For instance, a client may 
be a server to satisfy the needs of several users who 
are disconnected from the rest of the system while ac- 
tively collaborating, such as a group of colleagues taking 
a business trip together. Rather than merely giving a 
member of this disconnected working group access to 
only the data that he had the foresight to copy to his 
personal machine, the server/client model of Fig. 1 pro- 
vides sufficient flexibility to let any group member have 
access to any data that is available in the group. 

The notion of permitting servers to reside on porta- 
ble machines is similar to the approach taken to support 
mobility in existing systems, such as Lotus Notes and 
Ficus. 

Database replication is needed to enable non-con- 
nected users to access a common database. Unfortu- 
nately, many algorithms for managing replicated data, 
such as those based on maintaining strong data con- 
sistency by atomically updating all available copies, do 
not work well in a partitioned network such as is con- 
templated for the illustrated embodiment, particularly if 
site failures cannot be reliably detected. Quorum based 
schemes, which can accommodate some types of net- 
work partitions, do not work well for disconnected indi- 
viduals or small groups. Moreover, algorithms based on 
pessimistic locking are also unattractive because they 
severely limit availability and perform poorly when mes- 
sage costs are high, as is generally the case in mobile 
environments. 

Therefore, to maximize a client's ability to read and 
write data, even while completely disconnected from the 
rest of the computing environment, a read-any/write-any 
replication scheme, is preferred. This enables, a user to 
read from, as at 21 -23, and write to, as at 25, any copy 
of the database. The timeliness with which writes will 
propagate to all other replicas of the database, as at 26 
and 27, cannot be guaranteed because communication 
with certain of these replicas may be currently infeasi- 
ble. Thus, the replicated databases are only weakly con- 
sistent. Techniques for managing weakly consistent rep- 
licated data, which have gained favor not only for their 
high availability but aiso for their scalability and simplic- 
ity, have been employed in a variety of prior systems. 

As shown in some additional detail in Fig. 2, servers 
11 and 12 propagate writes, as at 26, among copies of 



a typical database 30 using an "anti-entropy*' protocol. 
Anti-entropy ensures that all copies of a database 30 
are converging.towards the same state and wilt eventu- 
ally converge to identical states if there are no new up- 

5 dates. To achieve this, the servers 1 1 and 1 2, as well as 
all other servers, must not only receive all writes but 
must also order them consistently 

p eer -to-peer anti -entropy is employed to ensure 
that any two servers that are able to communicate will 

io be able to propagate updates between themselves. Un- 
der this approach, even machines that never directly 
communicate can exchange updates via intermediaries. 
Each server periodically selects another server with 
which to perform a pair-wise exchange of writes, as at 

is 26; with the server selected depending on its availability 
as well as on the expected costs and benefits. At the 
end of this process, both servers 11 and 12 have iden- 
tical copies of the database 30, - viz., at the end of the 
process, the servers 11 and 12 have the same writes 

20 effectively performed in the'same order. Anti-entropy 
can be structured as an incremental process so that 
even servers with very intermittent or asymmetrical con- 
nections can eventually bring their databases into a mu- 
tually consistent state. 

25 

B. Session Guarantees 

A potential disadvantage of using read-any/write 1 
any replication is that inconsistencies can appear within 

30 different instances of a given database 30, even when 
only a single user or application is making data modifi- 
cations. For example, a mobile client might issue a write 
at one server, such as the server 12, and later issue a 
read at a different server 1 1 . The client would see incon- 

35 sistent results, unless these two servers 11 and 12 had 
performed anti-entropy, with one another or through a 
common chain of intermediaries; sometime between the 
execution of those two operations. 

To alleviate these problems-, session guarantees 

40 are provided. A "session" is an abstraction for the se- 
quence of read and write operations performed on a da- 
tabase, such as thedatabase 30, by one or more par- 
ticipants in the session during the execution of an appli- 
cation. One or more of the following four guarantees can 

45 be requested of a session manager 32 or 33 on a per- 
session basts: 

. ' Read Your Writes - during the course of a session, 
read operations by the participants reflect all previ- 
' so bus writes by the participants. 

Monotonic Reads - successive reads by the partic- 
ipants reflect a non-decreasing set of writes 
throughout a session. 
• Writes Follow Reads : during a session, the writes 
55 by the participants are propagated after reads on 
which they depend. 

Monotonic Writes - during a session, the writes by 
the participants are propagated after writes that log- 
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icalfy. precede them. 

These guarantees can be invoked to give individual 
applications a view of the database 30 that is consistent 
with their own actions, even if these applications read 
and write from various, potentially inconsistent servers. 
Different applications have different consistency re- 
quirements and different tolerances for inconsistent da- 
ta. For this reason, provision advantageously is made 
for enabling applications to choose just the session 
guarantees that they require. The.main cost of request- 
ing session guarantees is a potential reduction in avail- 
ability because the set of servers that are sufficiently up- 
to-date to meet the guarantees may be smaller than all 
the available servers. Those who want more information 
on these session guarantees can consult a paper of 
Douglas B. Terry et al., "Session Guarantees tor Weakly 
Consistent Replicated Data," Proceedings International . 
Conference on Parallel and Distributed Information Sys- 
tems (PD1S), Austin. TX. September 1 994, pp, 140-149. 

C. Application Specific Detection of Update Conflicts 

Because several clients may make concurrent 
writes to different servers or may attempt to update 
some data based on reading an out-of-dato copy, up- 
date conflicts are unavoidable in a read-any/write-any . 
replication scheme. These conflicts have two basic 
forms: write-write conflicts which occur, when a plurality 
of clients update the same date item (or sets of data 
items) in incompatible ways and read-write conflicts 
which occur when one client upaates.some data that is. 
based on reading the value oj another data item that is 
being concurrently updated by a second client (or,, po- 
tentially, when, the read is directed at a data item that 
was previously updated on a different server than the 
one being read). 

Version vectors or simple timestamps are popularly 
used to detect write-write conflicts Read-write conflicts, 
on the other hand, can be detected by recording and 
later checking an application's read-set. However, all 
these techniques ignore the applications' semantics. 
For example, consider a calendar manager in which us- 
ers interactively schedule meetings by selecting blocks 
of time. A conflict, as viewed by the application, does 
not occur merely because two users concurrently edit 
the file containing the calendar data Rather conflicts 
arise if two users schedule meetings al the same time 
involving the same attendees. 

Accordingly, it is more useful to detect update con- 
flicts in an application-specific manner A write conflict 
occurs when the state of the database differs in an ap- 
plication-relevant way from the state that is expected by 
a write operation. Therefore, a write operation advanta- 
geously includes not only the data being written or up- 
dated (i.e., the update set), but also a dependency set. 
The dependency set is a collection of application-sup- 
plied queries and their expected results A conflict is de- 



tected if the queries, when run at a server against its 
current copy of a database, do not return the expected 
results. 

These actions, as well as the resolution of any con- 

s flict that happens to be detected and the application of 
any appropriate updates to the database copy on the 
server that is processing the write operation, are carried 
out atomicaliy.from the viewpoint of all other reads and 
writes the server performs on that particular database. 

io For the purpose of this embodiment it is assumed the 
database to be relational. 

In keeping with more or less standard practices, an 
update set is composed of a sequence of update 
records. An update record, in turn, (a) specifies an up- 

'5 date operation (i.e., an insert, delete, or modify), (b) 
names the database relation to which the specified up- 
date operation is to be applied, and (c) includes a tuple 
set that should be applied to the named database rela- 
tion according to the named operation. Execution of an 

20 insert operation causes the related tuple set to be added 
to the name relation. On the other hand, the delete and 
modify operations examine the tuples currently in the 
named relation of the database to delete or replace, re- 
spectively, any of those tuples that match on the primary 

2S key of any of the tuples in the specified tuple set. 

A dependency set is a sequence of zero or more 
dependency records; each of which contains a query to 
run against the database, together with a tuple set that 
specifies the "expected" result of running that query 

30 against the database in the absence of a conflict. As pre- 
viously pointed out, a conflict is detected if any of the 
queries, when run at a server against its current copy of 
the database,. fail to return the expected result. 

As shown in Fig. 3, a write operation 63 is applied 

35 to a database, as at 41 , only after it has been confirmed 
at 42 that no conflict has been detected by a conflict 
detection process 43. If a conflict is found to exist, its 
existence is reported or steps are taken to resolve it, as 
at 44. 

Referring to Fig. 4, the application -specific conflict 
detection process 43 runs one after another all depend- 
ency queries for.a particular write operation against the 
current version of the database at the server executing 
the write. To this end, an index K is initialized at 45 to a 

45 value that is equal to the number of dependency queries 
that are specified by the dependency set for the given 
write operation. If K initializes to a "0" value, it is con- 
cluded at 46 that there are no dependency checks and, 
therefore, a "no conflict" finding is forthcoming, as at 47. 

50 |f, however, it is determined at 46 that there are one or 
more dependency checks embedded in the given write 
operation, the query for the first of these checks is run 
against the database, as at 48, and the results it returns 
are compared against the expected results of running 

55 that particular query, as at 49. If the actual and expected 
results match, the dependency check is satisfied, so the 
index K is re-evaluated at 51 to determine whether there 
are any additional dependency checks to be performed. 
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If so, the index K is decremented at 52, and the next solves the conflict by producing, as at 56, an alternate 

dependency check is performed at 48 and 49. set of updates that are appropriate for the current data- 

If it is found at 49 that the actual results returned by base contents/as at 57. 
the database in response to any of the dependency que- The revised update set produced by the execution 

ries fail to match the expected results, the conflict de- s of a mergeproc may consist of a new set of tuples to be 
tection process 43 (Fig.3) is brought to a conclusion at applied to the database, a null set of tuples (i.e., nothing 
53 in a "conflict" state. On the other hand, if all of the should be applied), a set of one or more tuples to be 
dependency checks for a given write operation are sat- applied to a special error log relation in the database, or 
isfied. the conflict detection process is brought to a con- a combination ol the above. 

elusion at 54 in a "no conflict" state. *° Mergeprocs resemble mobile agents in that they 

As will be evident, dependency sets can provide tra- originate at clients, are passed to servers, and are exe- 
ditional optimistic concurrency control by having the de- cuted in a protected environment, as at 58, so that they 
pendency queries check the version stamps of any data cannot adversely impact the server's operation. Howev- 
that was read and on which the proposed update de- er, unlike more general agents, they can only read and 
pends However, the dependency checking mechanism is write a server's database. A mergeproc's execution 
is more general For example, dependency checking must be'a deterministic function of the database con- 
permits "blind" writes where a client does not have ac- tents and mergeproc's static data, 
cess to any copy of the database yet wishes to inject a Typically, to provide a "protected environment" for 

database update assuming that some condition holds. executing these mergeprocs, each of the mergeprocs is 
For instance, a client may wish to use a laptop computer 20 a function that is written in a suitable language, such as 
to schedule a meeting in a particular room, assuming Tel, to run in a new created interpreter in the address 

that the room is Iree at the desired time, even though space of the server executing the mergeproc. The inter- 

the client does not currently have access to a copy of preter exits after it has run the mergeproc. 

the room's calendar In this case the write operation that Mergeproc functions take no input parameters, but 

tries to update the meeting room calendar to reserve the 25 they produce a new update set as their output. More par- 
room, would include a dependency query that would be ticularly, mergeprocs can invoke and receive the results 
run prior to the execution of the write operation by a serv- of read-only database queries against the current state 
er to determine it the room is free during the time slot of the database. Other than this, however, they cannot 
specified for the meeting obtain information about their surroundings and cannot 

30 affect their surroundings (other than by returning the up- ' 
D. Application Specific Resolution of Update Conflicts date set they produce). In particular, they cannot inquire 

about non -deterministic variables, such as the current 
Advantageously, the system not only detects up- . time or the states of various other resources of the serv- 
date conflicts, but also resolves any detected conflicts. er or the host it runs on because such inquiries could 
One approach to conflict resolution that is often taken 35 produce non-deterministic results. Suitably, these re- 
in database systems with optimistic concurrency control strictions are enforced by modifying the Tel interpreter 
is to simply abort each conflicting transaction. Other sys- that is used to disallow prohibited operations. Such 
terns rely on humans for resolving conflicts as they are "safe" interpreters are well-known to practitioners of the 
detected. Human resolution, however, is disadvantaged art. 

in a mobile computing environment because a user may *o it is noted that automatic resolution of concurrent 
submit an update to some server and then disconnect updates to file directories has been proposed for some ' 
whilethe write is propagating in the background via anti- time and is now being employed in systems like Ficus 
entropy. Consequently, at the time a write conflict is de- and Coda. These systems have recently added support 
tected (i.e. when a dependency check fails) the user for application-specific resolution procedures, similar to 
may be inaccessible. .: 4S me rgeprocs, that are registered with servers and are in - 

In the illustrated embodiment/provision is made to voked automatically when conflicts arise However, in 
allow writes to specify how to resolve conflicts automat- these existing systems the appropriate resolution pro- 
ically based on the premise that there are a significant cedure to invoke is chosen based on file properties such 

number of applications for which the order of concur- as the type of the file being updated. Mergeprocs are 

rently issued write operations is either not a problem or so more flexible because they may be customized for each 
can be suitably dealt with in an application-specific man- write operation based on the semantics of the applica- 

tion and on the intended effect of the specific write. For 
example, in the aforementioned calendar application, a 
mergeproc 'may include a list of alternate meeting times 
to be tried if the first choice is already taken. 

In summary, in the instant system a write operation 
consists of a proposed update, a dependency set and 
a mergeproc. The dependency set and mergeproc are 



ner at each server maintaining a copy of a database. To 
carry out this conflict resolution process, as shown in 
Fig. 5, each write operation includes an application-spe- 
cific procedure, called a "mergeproc" (merge proce- 
dure), that is invoked , when a write conflict is detected, 
as at 53 (also see Fig. 4). This procedure reads the da- 
tabase copy residing at the executing server and re- 
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both dictated by an application's semantics and may 
vary for each write operation issued by the application. 
The verification of the dependency check, the execution 
of the mergeproc, and the application of the update set 
is done atomically with respect to other database ac- $ 
cesses on the server 

E. Stabilizing Writes 

The weak consistency of the replicated databases 10 
that this system envisions means that a write operation 
may produce the desired update at one server but be 
detected as a conflict at another server, thereby produc- 
ing a completely different update as the result of exe- 
cuting its mergeproc. Also, a write 's mergeproc may pro- *5 
duce different results at different servers because the 
execution of the mergeproc may depend on the current 
database state. Specifically, varying results can be pro- 
duced if the servers have seen different sets of previous 
writes or if they process writes in different orders. 20 

To achieve eventual consistency, servers must not 
only receive all writes, but mustalso agree on the order 
in which they apply these writes to their databases. As 
will be seen, some writes obtained via anti-entropy may 
need to be ordered before other writes that were previ- 2s 
ously obtained, and may therefore, cause previous 
writes to be undone and reapplied to the server's data- 
base copy. Notice that, reapplying a write may cause it 
to update the database in a way that, differs from the 
update produced by its previous execution. 30 

A write is deemed to be "stabilized" when its effects 
on the database are permanent, that is, when it will nev- 
er be undone and re-executed in the future. One way to 
detect stability of a given write is to gather enough in- 
formation about each server to determine that no other 35 
writes exist that no other write will be accepted in the 
future that might be ordered prior to the given write. Un- 
fortunately, the rate at which writes stabilize in this fash- 
ion would depend on the rate at which anti-entropy prop- 
agates information among all servers. For example, a 40 
server that is disconnected for extended periods of time 
could significantly delay stabilization and might cause a 
large number of writes to be rolled back later. 

As indicated by the schematic of the write log 60 in 
Fig. 6, the illustrated embodiment includes the notion of 45 
explicitly "committing'' a write. Once a. write is commit- 
ted, its order with respect to all other committed writes 
is fixed and no un-committed writes will be ordered be- 
fore it, and thus its outcome will be stable. A write that 
has not yet been committed is called "tentative". so 

A client can inquire as to whether a given write is 
committed or tentative. The illustrated system allows cli- 
ents to read tentative data, if they want to do so. How- 
ever, those applications that are unprepared to deal with 
tentative data and its inherent instability may limit their 55 
requests to only return committed data. This choice is 
similar to the strict and loose read operations that have 
been implemented by others Essentially, each server 
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maintains two views of the database: a copy that only 
re fleets comm itted data , and anoth e r "f u 1 1 " copy that a Iso 
reflects the tentative writes currently known to the serv- 
er. The full copy is an estimation of what the database 
will contain when the tentative writes reach the primary 
server. 

One way to commit a write would be to run some 
sort of consensus protocol among a majority of servers 
However, such protocols do not work well for the types 
of network partitions that occur among mobile comput- 
ers. 

Instead, in the instant system, each database has 
one distinguished server, the "primary", which is respon- 
sible for committing writes to that database. The other, 
■secondary" servers tentatively accept writes and prop- 
agate them toward the primary using anti-entropy. After 
secondary servers communicate with the primary, and 
propagate their tentative writes to it, the primary, con- 
verts these writes to committed writes, and a stable 
commit order is chosen for those writes by the primary 
server. Knowledge of committed writes and their order- 
ing propagates from the primary back to the secondar- 
ies, again via anti-entropy. The existence of a primary 
server enables writes to commit even if other secondary 
servers remain disconnected. In many cases, the pri- 
mary may be placed near the locus of update activity for 
a database,thereby allowing writes to commit as soon 
as possible. 

More particularly, for stabilizing writes through the 
use of an explicit commit process of the foregoing type, 
write operations that a server accepts from a client ap- 
plication are handled differently than those that are re- 
ceived from another server. As shown in Fig. 7, writes 
received from a client are first assigned a unique )D, as 
at 61 . Unique IDs are chosen by each server in such a 
way that a new write always gets ordered at the end of 
the server's write log. Thereafter, the write is appended, 
as at 62, to the tail or "young" end of the write log 60 
(Fig. 6) within the server for the database to which the 
write is directed. Further, the write is executed, as at 63, 
to update the current state of the database. 

On the other hand, as shown in Fig. 8, when a new 
write (i.e., a write not already in the write log 60 as de- 
termined at 64) is received from another server via anti- 
entropy, the write is not necessarily appended to the 
young end of the write log 60. Instead, a sort key is em- 
ployed to insert the write into the write log in a sorted 
order, as at 65. A commit sequence number (CSN) is 
used as the sort key for ordering committed writes, while 
the write ID is used as the sort key for ordering tentative 
writes. These sort keys and the way they are assigned 
to the writes are described in more detail hereinbelow. 
At this point, however, it should be understood that both 
the tentative writes and the committed writes are con- 
sistently ordered within those two different classifica- 
tions at all servers that have the writes or any subset of 
them However, the reclassification of a write that occurs 
when a server learns that one of its tentative writes has 
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been committed can cause that write to be reordered 
relative to one or more of the other tentative writes be- 
cause a different sort key is used for the write once it is 
committed As will be seen, steps preferably are taken 
to reduce the frequency and magnitude of the re-order- 
ing that is required because of the computational cost 
of performing the re-ordering, but some re-ordering still 
should be anticipated. 

Whenever a server inserts a write that was received 
from another server into its write log at 65, the server 
determines at 66 whether the write is being inserted at 
the young end of the log 60 or at some other position 
therein. If it is found that the write simply is being ap- 
pended to the young end ol the log, the write is executed 
at 63 to update the current state of the database (see 
Fig. 3). Conversely, if the write sorts into any other po- 
sition in the write log 60, a rollback procedure is invoked, 
as at 68, for "rolling back" the database to a state cor-* 
responding to the position at which the new write is in- 
serted in the write log 60 and for then sequentially re- 
executing, in sorted order, all writes that are located in 
the write log between the insert position for the new write 
and the young end of that 60. 

As previously mentioned, a write is stable only after 
it is committed. Moreover, once a write is committed, it 
never again has to be executed. Thus, a server need 
only have provision for identifying which writes have 
been committed, and need not fully store the write op- 
eration that it knows to be committed. Accordingly, some 
storage capacity savings may be realized 

it was already pointed out that each database relies 
on just one server at a time (the "primary server") for 
committing writes to ensure that there is a consistent 
ordering of all the committed writes for any given data- 
base. This primary server commits each of these writes 
when it first receives it (i.e., whether the write is received 
from a client application or another server), and the com- 
mitted state of the write then is propagated to all other 
servers by anti-entropy. Fig. 9 adequately illustrates the 
behavior of a primary server when it receives a write 
from a client. Each write that the primary server receives 
from a client is assigned a unique write ID, as at 61 , plus 
the next available CSN in standard counting order, as 
at 69. Thereafter, the write is appended to the tail of the 
write - log (the log contains only committed writes), as 
at 70, and the write is executed, as at 63. 

As shown in Fig. 10, writes a secondary server re- 
ceives from other servers via anti-entropy are examined 
at 90 to determine whether they are in the appropriate 
location in the write log 60 for that server. If so, the write 
is ignored, as at 91 . Otherwise, however, the write is fur- 
ther processed at 92 in accordance with iFig. B to deter- 
mine whether it is a new write and, if so, to insert it into 
the appropriate tentative location in the server's write 
log 60 and to apply it to the full database Moreover, the 
write also is examined at 93 to determine whether it has 
been committed by the primary server. If it is found at 
93 that the write has an apparently valid CSN, a process 



is invoked at 94 for committing the write at the secondary 
server and for re-executing it and all tentative writes if 
the committing of the write causes it to be re-ordered. 
Referring to Fig. 11, while committing a write re- 
5 ceived from another server, a secondary server re- 
moves any prior record that it has of the write from its 
tentative writes, as at 71 , and appends the write to the 
young end of the committed write portion of its write log, 
as at 72 if it is determined at 73 that the ordering of the 
10 write in the write log 60 is unaffected by this reclassifi- 
cation process, no further action is required. If, however, 
the reclassification alters the ordering of the write, the 
database is rolled back as at 74 to a state corresponding 
to the new position of the write in the write log 6d : and 
1 $ all writes between that position and the young end of the 
tentative portion of the write log 60 are re-executed' as 
at 63. . 

Database "roll back" and "rolMorward" procedures 
are well known tools to database system architects. 
20 Nevertheless, in the interest of completeness, a suitable 
roll back procedure is shown in Fig. 12. As shown, the 
procedure is initialized (1 ) by setting a position index, p, 
to the positional location in the write log of- the write 
record to which it is desired to roll back, as at 75, and 
25 * (2) by setting a pointer k and a threshold count n to the 
total number of write records in the write log, as at 76. 
An iterative undo process is then run on the database, 
as at 77, to undo the effects on the database of one after 
another of the most recent writes while decrementing 
30 the pointer index k at 7B after the effect of each of those 
writes is undone and checking to determine at 79 wheth- 
er there are any additional writes that still need to be 
undone (The undo of a write that has not been applied 
to the database does nothing and writes can be undone 
35' in an order different than they were applied to the data- 
base.) This process 77- 79 continues until it is deter- 
mined at 79 that the pointer index k is pointing to the 
same position in the write log as the position index p. 
When that occurs, the write at which the pointer k is then 
40 pointing is executed as at 63 (Fig. 3). If it is determined 
at 81 that the pointer k is pointing at any write record, 
other than one at the young end of the write log 60, the 
pointer k is incremented at 82 to cause the next write in 
order toward the young end of the log to be re-executed 
45 at 63. Further iterations of this write re-execution proce- 
dure 63,81,82 the next foilowing write instructions are 
carried out, until it is determined at 81 that the pointer k 
is pointing at the young end of the write log 60 (Fig. 6). 
Fig 1 2 illustrates a scenario of the type that causes 
so write re-ordering, and Figs. 1 3 and 1 4 track the scenario 
of Fig. 12 to show when the servers receive the writes 
and the current logs containing writes (1 ) in a tentative 
state (italicized characters) and (2) in a committed state 
(boldface characters) To simplify the presentation, the 
55 scenario assumes that each of the servers S1 - 5h ini- 
tially holds a single committed write, WO. Server Sn has 
been designated as being the primary server, so it is 
solely responsible for committing the writes W1 and W2. 
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As will be recalled, committed writes are ordered in 
accordance with their commit sequence numbers 
(CSNs). Tentative writes, on the other hand, are or- 
dered; first by timestamps that indicate when they were 
initially received and secondly by the IDs of the servers 
by which they were initially received. Both the times- 
tamps and server IDs are included in ID Server IDs are 
used as a secondary sort key for disambiguating the or- 
dering of tentative writes that have identical timestamp 
values. 

F. Reading of Tentative Data by Clients and 
Disconnected Groups 

Clients that issue writes generally want to see these 
updates reflected in their subsequent read requests to 
the database. Further, some of these clients may even 
issue writes that depend on reading their previous 
writes. This is likely to be true, even if the client is dis- 
connected from the primary server such that the up- 
dates cannot be immediately committed. At any rate, to 
the extent possible, clients should be unaware that their 
updates are tentative and should see no change when 
the updates later commit; that is, the tentative results 
should equal the committed results whenever possible. 

When two secondary servers exchange tentative 
writes using anti-entropy, they agree on a tentative" or- 
dering for these writes. As will be recalled, order is 
based in the first instance on timestamps assigned to 
each write by the server that first accepted it so that any 
two servers with identical sets of writes with different 
timestamps will order them identically. Thus, a group of 
servers that are disconnected from the primary will 
reach agreement among themselves on how to order 
writes and resolve internal conflicts. This write ordering 
is only tentative in that it may differ from the order that 
the primary server uses to commit the writes. However, 
in the case where no clients outside the disconnected 
group perform conflicting updates, the writes can and 
will eventually be committed by the primary server in the 
tentative order and produce the same effect on the com- 
mitted database as they had on the tentative one. 

Conclusions 

As will be appreciated, the architecture that has 
been provided supports shared databases that can be 
read and updated by users who may be disconnected 
from other users, either individually or as a group. Cer- 
tain of the features of this architecture can be used in 
other systems that may have similar or different require- 
ments. For example, the application specific conflict de- 
tection that is described herein might be used in systems 
that rely upon manual resolution of the detected con- 
flicts. Similarly, the application specific conflict resolu- 
tion methodology might be employed in systems that uti- 
lize version vectors for conflict detection. 

Briefly, the steps in processing a write operation can 



be summarized in somewhat simplified terms as follows: 

0. Receive write operation from user or from anoth- 
er server. 

5 1. If from user, then assign unique identifier (ID) to 

write of form <server ID, timestamp> and assign 
commit sequence number (CSN) = INFINITY. A 
CSN value of infinity indicates that the write is ten- 
tative. 

io 2. If primary server, then assign commit sequence 
number = last assigned CSN + 1. 
3. Insert write into server's write log such that all 
writes in the log are ordered first by CSN, then by 
timestamp, and finally be server ID. 

'5 4. If write was previously in log at time it is entered 
into commit portion of log, then delete the prior in- 
stance to produce new write log. 

5. If write not at the end of the write log, then rollback 
the server's database to the point just before the 

20 new write. 

6. For each write in the log from the new write to the 
tail of the log, do 

6.1 Run the dependency query over the data- 
25 base and get the results. 

6.2 If the results do not equal the expected re- 
sults then go to step 6.5. 

6.3 Perform the expected update on the data- 
base. 

30 6.4 Skip the next steps and go back to step 6. 

6.5 Execute the mergeproc and get the revised 
update. 

6.6 Perform the revised update on the data- 
base. 

35 

To utilize just the application-specific detection of 
conflicting writes portion of the process as summarized 
above, 

40 - Eliminate step 2 of the process as summarized 
above, and 

Replace steps 6.5 and 6.6 with a new step "6.5 
Abort write operation and/or report conflict to user". 

45 Or, to employ the application -specific resolution of 
detected write conflicts by itself or in some other proc- 
ess, 

Eliminate step 2, and 
^o - Replace steps 6.1 and 6.2 with a new step "6.1 If 
conflict detected by comparing version vectors or 
some method then go to step 6.5 ". 

Lastly, to use just the notion of maintaining two 
55 classes of data, committed and tentative. 

Eliminate steps 6.1 and 6.2, and 
Eliminate steps 6.4, 6.5 and 6.6. 
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Claims 

1. An application -specific process for resolving con- 
flicts that are found to exist between an instance of 
a database and write operations that are presented 5 
for updating said instance of said database, said 
process comprising: 

associating an arbitrary merge-procedure, 
specified by said application, with each of said 10 
write operations, where the effect of executing 
the merge-procedure is deterministic for any 
given state of said instance of said database at 
any time that said merge-procedure is execut- 
ed; and 15 
executing the merge-procedure associated 
with agiven write operation whenever it is found 
that the given write operation conflicts with said 
instance of said database, said merge-proce- 
dure producing a set of updates that are applied 20 
to said database in lieu of any updates original- . 
ly contemplated by the given write operation. 
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